STAT 341/641: Intro to EDA and Statistical Computing
Lab #1: Ggplot
Teaching Assistant: “Yanjun Liu”


Directions: The following contains tasks you must complete to receive full credit for this homework. Consult the R markdown cheatsheet on canvas if you have questions about markdown syntax.


#Task One: Chapter 3 of R for Data Science

Navigate to https://r4ds.had.co.nz/. You will work through sections 3.1 to 3.10 in this laboratory. Replicate each computation performed in the chapter and answer the associated questions.

##3.1: Introduction.

Solution: (Write your code in the following block. You can add additional blocks to in order to write text between the blocks.)

r = getOption("repos")
r["CRAN"] = "http://cran.us.r-project.org"
options(repos = r)
install.packages("tidyverse")
## Warning: unable to access index for repository http://cran.us.r-project.org/src/contrib:
##   cannot open URL 'http://cran.us.r-project.org/src/contrib/PACKAGES'
## Warning: package 'tidyverse' is not available (for R version 3.5.1)
## Warning: unable to access index for repository http://cran.us.r-project.org/bin/macosx/el-capitan/contrib/3.5:
##   cannot open URL 'http://cran.us.r-project.org/bin/macosx/el-capitan/contrib/3.5/PACKAGES'
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.5.2
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.3
## ✓ tidyr   1.0.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## Warning: package 'ggplot2' was built under R version 3.5.2
## Warning: package 'tibble' was built under R version 3.5.2
## Warning: package 'tidyr' was built under R version 3.5.2
## Warning: package 'purrr' was built under R version 3.5.2
## Warning: package 'dplyr' was built under R version 3.5.2
## Warning: package 'stringr' was built under R version 3.5.2
## Warning: package 'forcats' was built under R version 3.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

##3.2: First steps

Solution:

mpg
## # A tibble: 234 x 11
##    manufacturer model    displ  year   cyl trans   drv     cty   hwy fl    class
##    <chr>        <chr>    <dbl> <int> <int> <chr>   <chr> <int> <int> <chr> <chr>
##  1 audi         a4         1.8  1999     4 auto(l… f        18    29 p     comp…
##  2 audi         a4         1.8  1999     4 manual… f        21    29 p     comp…
##  3 audi         a4         2    2008     4 manual… f        20    31 p     comp…
##  4 audi         a4         2    2008     4 auto(a… f        21    30 p     comp…
##  5 audi         a4         2.8  1999     6 auto(l… f        16    26 p     comp…
##  6 audi         a4         2.8  1999     6 manual… f        18    26 p     comp…
##  7 audi         a4         3.1  2008     6 auto(a… f        18    27 p     comp…
##  8 audi         a4 quat…   1.8  1999     4 manual… 4        18    26 p     comp…
##  9 audi         a4 quat…   1.8  1999     4 auto(l… 4        16    25 p     comp…
## 10 audi         a4 quat…   2    2008     4 manual… 4        20    28 p     comp…
## # … with 224 more rows
?mpg
ggplot(data = mpg) + geom_point(mapping = aes(x = displ,y = hwy))

ggplot(data = mpg) + geom_point(mapping = aes(x = drv,y = class))

# Exercises ***3.2.4***
# Question 1) ggplot(data = mpg) does not display any data
# Question 2) 234 Rows and 11 columns
# Question 3) Front wheel drive/Rear wheel drive/ 4 wheel drive car
# Question 4)
    ggplot(data = mpg) + geom_point(mapping = aes(x = cyl,y = hwy))

# Question 5) The plot is not useful because it comparing two categorical variables which doesnot give us any useful data at all

##3.3: Aesthetic mappings

Solution:

# Mapping class to color
ggplot(data = mpg) + geom_point(mapping = aes(x = displ,y = hwy,color = class))

#Class to Size 
ggplot(data = mpg) + geom_point(mapping = aes(x = displ,y = hwy,size = class ))
## Warning: Using size for a discrete variable is not advised.

#Class to alpha aesthetic
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy,alpha = class))
## Warning: Using alpha for a discrete variable is not advised.

#Class to Shape
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy,shape = class))
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 7. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 62 rows containing missing values (geom_point).

# ***Exercises 3.3.1***
# Question 1) Because color = "blue" here is considered as a mapping between the two variables, it should be placed outside the aes parantheses
# Question 2) Continuous:displ,year,cyl,cty,hwy; Categorical: model,trans,drv,fl,class
# Question 3) 
      ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y= hwy, color=cyl)) # Same color that varies in transparency

      ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y= hwy, size=cyl))

# Question 4) Mapping a single variable to multiple aesthetics is bad practice and rather redundant
# Question 5) Stroke changed the size of the border for the shape.
# Question 6) This works and highlights all the values less than 5 as such:
        ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y= hwy, colour= displ < 5))

##3.4: Common problems

Solution:

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))

##3.5: Facets

Solution:

# Playing around with facets
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_wrap(~ class, nrow = 2)

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_grid(drv ~ cyl)

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_grid(. ~ cyl)

# *** Exercises 3.5.1 ***
# Question 1) The Continous variable is simply treated as a categorical variable
# Question 2) The empty cells in this plot are combinations of drv and cyl that have no obeservations
# Question 3)  The symbol '.' ignores the second variable when faceting.
  ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .)

# Question 4) Advantages: ability to encode more distinct categories, wider range of data (only limited to 9 colors), also handles overlapping better.
#       Disadvantages: Difficult to compare values and categories, visually limiting.
# Question 5) nrow = number of rows; ncol = number of columns
#Question 6) There will be more space for columns if the plot is horizontal

##3.6: Geometric objects

Solution:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy))  # Displaying a smooth line plot
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv)) #Adding linetype
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy, group = drv)) # Group
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = mpg) + geom_smooth( mapping = aes(x = displ, y = hwy, color = drv),
    show.legend = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

# Plotting and manipulating geom point + geom smooth
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  geom_smooth(mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

# ***Exercises 3.6.1***
# Question 1 ) line chart: geom_line()
# boxplot: geom_boxplot()
# histogram: geom_histogram()
# area chart: geom_area()
# Question 2) This code produces a scatterplot with displ on x-axis and hwy on the y axis, and the points are colored drv,without standard error.
# Question 3) The theme option hides the legend box, with three plots, adding a legend would change the size of the last plot, which would create unreliable data.
# Question 4) It adds standard error bands to the lines
# Question 5) No, because both plots use the same data and mappings. So they will use the same options.
# Question 6) 
  ggplot(data = mpg,mapping = aes(x = displ, y = hwy)) + geom_point()+ geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(mpg, aes(x = displ, y = hwy)) + geom_smooth(mapping = aes(group = drv), se = FALSE) + geom_point()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(mpg, aes(x = displ, y = hwy, colour = drv)) + geom_point() + geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(mpg, aes(x = displ, y = hwy)) +geom_point(aes(colour = drv)) +geom_smooth(aes(linetype = drv), se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(size = 4, color = "white") +
  geom_point(aes(colour = drv))

##3.7: Statistical transformations

Solution:

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut))

# Overriding data to counts
demo <- tribble(
  ~cut,         ~freq,
  "Fair",       1610,
  "Good",       4906,
  "Very Good",  12082,
  "Premium",    13791,
  "Ideal",      21551
)

ggplot(data = demo) +
  geom_bar(mapping = aes(x = cut, y = freq), stat = "identity")

# Proportion
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = stat(prop), group = 1))

# Stat Summary
ggplot(data = diamonds) + 
  stat_summary(
    mapping = aes(x = cut, y = depth),
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median
  )

# *** Exercises 3.7.1 ***
# Question 1) Point range plot; geom_pointrange(...)
# Question 2) geom_col() has the default stat of stat_identity(); geom_bar() has the defauly stat of stat_bin()
# Question 3) They have common names usually, and they have each other as default stats.
# Question 4) ymin: lower interval; xmax: upper interval; se: standard error; y: predicted value
# Question 5) geom_bar assumes that the groups are all equal to x so we will have same height

##3.8: Position adjustments

Solution:

# Coloring in bar graphs
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, colour = cut))

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = cut))

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity))

# Position adjustment
ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) + 
  geom_bar(alpha = 1/5, position = "identity")

ggplot(data = diamonds, mapping = aes(x = cut, colour = clarity)) + 
  geom_bar(fill = NA, position = "identity")

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")

# *** Exercises 3.8.1 ***
# Question 1) This is overplotting, we need to add position = "jitter"
# Question 2) Width and height
# Question 3) geom_jitter adds random variation to the locations points of the graph to improve the accuracy of our data, geom_count sizes the points relative to the number of observations. geom_count creates overlapping if points are close enough together and the size is large.
# Question 4)
  ggplot(data = mpg, aes(x = drv, y = hwy)) +  geom_boxplot()

##3.9: Coordinate systems

Solution:

# coord_flip
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot()

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot() +
  coord_flip()

# coord_flip
  ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot()

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot() +
  coord_flip()

#coord_quickmap
install.packages("maps")
## Warning: unable to access index for repository http://cran.us.r-project.org/src/contrib:
##   cannot open URL 'http://cran.us.r-project.org/src/contrib/PACKAGES'
## Warning: package 'maps' is not available (for R version 3.5.1)
## Warning: unable to access index for repository http://cran.us.r-project.org/bin/macosx/el-capitan/contrib/3.5:
##   cannot open URL 'http://cran.us.r-project.org/bin/macosx/el-capitan/contrib/3.5/PACKAGES'
nz <- map_data("nz")

ggplot(nz, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black")

ggplot(nz, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black") +coord_quickmap()

# Polar coordinates
bar <- ggplot(data = diamonds) + 
  geom_bar(
    mapping = aes(x = cut, fill = cut), 
    show.legend = FALSE,
    width = 1
  ) + 
  theme(aspect.ratio = 1) +
  labs(x = NULL, y = NULL)

bar + coord_flip()

bar + coord_polar()

# *** Exercise 3.9.1 ***
# Question 1) 
  ggplot(mpg, aes(x = factor(1), fill = drv)) +
  geom_bar(width = 1) +
  coord_polar(theta = "y")

# Question 2) The labs function adds axis titles and plot titles.
#Question 3) coord_map uses map plots to plot 3D plots onto 2D plane; coord_quickmap uses an approxiamte but faster map projection
# Question 4) coord_fixed makes sure that the line produced by geom_abline is at a 45 degree angle

##3.10: The layered grammar of graphics

Solution:

# No code

#Task Two: The UFO data

Read in the UFO data from canvas. Use ggplot2, and any other commands you know in R to answer the following question: “Is the distribution of UFO shapes similar in American states and Canadian provinces that share a border?”

You may find the text ggplot2 Elegant Graphics for Data Analysis to be helpful.

library(ggplot2)
library(grid)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
ufos <- read_csv("ufos_clean.csv")
## Parsed with column specification:
## cols(
##   Year = col_double(),
##   Month = col_double(),
##   Day = col_double(),
##   Hour = col_double(),
##   Minute = col_double(),
##   State = col_character(),
##   Shape = col_character(),
##   Duration_minutes = col_double()
## )
#Remove black values for shape

ufos <- subset(ufos, !(Shape == "" ))

# Subset for UFOs seen in American States

ufosUSA <- subset(ufos, !(State %in% c("AB","BC","MB","NB","NF","NS","ON","QC","SK")));
head(ufosUSA)
## # A tibble: 6 x 8
##    Year Month   Day  Hour Minute State Shape     Duration_minutes
##   <dbl> <dbl> <dbl> <dbl>  <dbl> <chr> <chr>                <dbl>
## 1  2017     4    19    23     29 CO    Light                   60
## 2  2017     4    19     0     40 AR    Light                  120
## 3  2017     4    18    14     30 NY    Teardrop               240
## 4  2017     4    16    21      0 UT    Circle                 240
## 5  2017     4    16    20      0 IL    Formation              120
## 6  2017     4    14    22     30 IN    Light                   60
# Subset for UFOs seen in Canadian Provinces that share a border

ufosCAUSA <- subset(ufos, State %in% c("BC","AB","SK","MB","ON","QC","NB"))

# Data Visualization

USA <- ggplot(data = ufosUSA) + geom_bar(mapping = aes(x = Shape,fill = Shape)) + ggtitle("American State UFO sighting") + theme(plot.title = element_text(size = 10)) +theme(axis.title.x=element_blank(), axis.text.x=element_blank(),axis.ticks.x=element_blank(),legend.title = element_blank(),plot.margin=unit(c(0.8,.8,0.5,.8),"cm"),legend.key.size = unit(.1, "cm"))

CanadaBorder <- ggplot(data = ufosCAUSA) + geom_bar(mapping = aes(x = Shape,fill = Shape)) +ggtitle("Canadian provinces bordering US States") + theme(plot.title = element_text(size = 10)) +theme(axis.title.x=element_blank(), axis.text.x=element_blank(),axis.ticks.x=element_blank(),legend.title = element_blank(),plot.margin=unit(c(.8,.8,0.5,.8),"cm"),legend.key.size = unit(.1, "cm"))

margin = theme(plot.margin = unit(c(2,2,2,2), "cm"))
grid.arrange(USA,CanadaBorder)

# Here we can see that the sightings in the US States and the Canadian borders are very similar, which might either explain that A) People in the US are weird, or B that the closer you get to the US the closer the results get. The data is very similar in shape and very similar in distribution. Both plots explain that the most common shape is the Light shape.